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O: Bayesian optimization (BO) algorithms try to optimize an unknown function that is expen- 

. sive to evaluate using minimum number of evaluations/experiments. Most of the proposed 

' algorithms in BO are sequential, where only one experiment is selected at each iteration. 

O . This method can be time inefficient when each experiment takes a long time and more than 

one experiment can be ran concurrently. On the other hand, requesting a fix-sized batch 
of experiments at each iteration causes performance inefficiency in BO compared to the 
. sequential policies. In this paper, we present an algorithm that asks a batch of experiments 

■ at each time step t where the batch size pt is dynamically determined in each step. Our 

t:;^ I algorithm is based on the observation that the sequence of experiments selected by the 

ff^ i sequential policy can sometimes be almost independent from each other. Our algorithm 

CO ' identifies such scenarios and request those experiments at the same time without degrading 

the performance. We evaluate our proposed method using the Expected Improvement pol- 
icy and the results show substantial speedup with little impact on the performance in eight 
real and synthetic benchmarks. 



1 Introduction 

Bayesian Optimization(BO) algorithms try to optimize an unknown function /(•) by requesting a set of 
experiments (evaluation of the function at requested points) when the function is costly to evaluate |6, 3]. 
In general, after selecting an experiment based on a selection criterion, we need to wait for the results of 
the experiment to select the next experiment (based on a better prior). This is commonly referred to as a 
sequential policy. The sequential framework is not efficient in applications where running an experiment is 
time consuming and we have the ability to run multiple experiments concurrently. 

Recently, Azimi et. al. f?) introduced a batch BO approach that selects a batch of experiments at each 
iteration that best approximates the behavior of a given sequential heuristic such as Expected Improvement 
(EI). Their results show that batch selection in general performs worse than a sequential policy, especially 
when the total number of experiments is small. This motivates us to introduce a dynamic batch BO approach 
that selects a varying number of experiments at each time step. Our goal is to select experiments that are 
likely to be selected by a given sequential policy. Specifically, at each step t , given a sequential policy 
(EI in this paper), we look for the possibility that the next pt > 1 samples xi, 2:2, Xp^ selected by EI are 
approximately independent from each other. That is, EI is likely to select example Xi in the i-th step regardless 
of the outcome of the previous z — 1 experiments. Upon finding such a set of independent experiments, we 
can ask for those experiments in batch and hence speedup our experiment process. Note that we might not be 
able to ask for more than one experiment at some iterations, since the next experiment might strongly depend 
on the currently selected/running experiments. 
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2 Gaussian Process (GP) 



Any BO algorithm consists of two main components, 1) the model of the unknown function which generates 
a posterior over the outputs of unobserved points in the input space, and 2) a selection criterion, which 
chooses an experiment at each iteration based on available prior. We use Gaussian Process (GP)^ as our 
primary model to build the posterior over the outcome values given our observations O = {xo, Vo) where 
xo = {xi,X2, ■ ■ •, x,,} and = {yi, 2/2, • • -Vn} such that = f{xj) and /(•) is our underlying function. 

For a new input point Xi, GP models the output yi as a normal random variable yi ^ J\f{fii,af), with 
Hi = k{xi,Xo)k{xo, Xoy^Vo and = k{x,,x,) - k{xi,Xo)k{xo,Xoy^k{xo,X'i.) where k{-, •) is an 
arbitrary symmetric positive definite kernel function to compute the coherence between any pair of elements. 
We use squared exponential as our kernel function, k[x, y) ~ exp(— j \\ x ~ y |p) where || • || is the vector 
2-norm and I is the constant length scale parameter. Our approach is flexible to the choice of kernel function. 

Definition 1. Let x* — {xj, • • •, xj^} G X\xo be any unobserved set of points. Let y* = {yi,y2,'' 
'Um} their corresponding predictions obtained from the Gaussian Process model learned from existing 
observations, where y*\0 ^ N{p*,(T*'^). Define cr*"^ ~ (cr^^, cr2^, •, ••, crj^) and fi* — (/x*, /ij, •, ••, /ij„) 
and for any point z <^ X \ {xq U x*}, letp{yz\z, O) ^ J\f{pz,cr'^) andp{yz\z, O, {x* ,y*)) ^ N{^il,a*^). 

Using GP as our primary model, the variance of any point z depends only on the location of the observed 
points and is independent from the observation outputs. Therefore, we can easily evaluate the variance of any 
point z after requesting experiments at any set of points x* without knowing their observation outputs. The 
following theorem characterizes the variance of z after sampling at x* . 

Theorem 1. Let A*{az) := crl - <t*^, then 

A*(cr.) - {PA-^B^ - kl) m {PA-^B^ ~ k*^^ , (1) 
where B = k{x*,xo), A = k{xo,xo), P ~ k{z,xo), m = (fc(a;*,a;*) — BA^^B^)^^ and kl = k{z,x*). 

The expected output value pi of any point z, however, heavily depends on the outputs of points x*, which 
are not available. Below, we provide an upper-bound on the change of after samphng at x*. 

Theorem 2. Let A*{pz) = l^-z — Mz- ^'^^w 

^y*[\A*{ti,)\]<\\(pA''B^ -k:)m\\ J-||o-*||i, (2) 

where \\ ■ \\ is the vector norm. 

Interestingly, the above stated upper-bound of Ej,. [|A*(/X2)|] can be considered as a function of sum of tr* 
and independent from the observation outputs. Therefore, Ey. [| A*(/i2)|] converges to zero as the number of 
observations increases since the variance decreases. 

3 Dynamic Batch Design for Bayesian Optimization 

In the sequential approach, we ask for only one experiment at each time step using a given policy tt (the 
selection criterion). Suppose we have the capability of running Ub experiments in parallel, and we are limited 
by the total number of possible experiments n;. At each time step t, the question is whether or not we 
can select I < pt < ni, samples to speed up the experimental procedure without loosing any performance 
comparing to the sequential policy tt. In i2|], the authors tried to select a batch of k examples by predicting the 
selected samples at the next k steps by a given policy tt using Monte Carlo simulation. However, frequently 
the predicted set of experiments are not exactly identical to those selected by the sequential policy (e.g. 
EI) especially when the batch size k is large. In this section, we introduce an algorithm that selects pt > 
1 experiments at each iteration based on (but not restricted to) the EI policy, where the batch size pt is 
dynamically determined at each step. In particular, it will select more than one experiments when/if the next 
set of experiments to be selected by tt are believed to be independent from the outcome of the already selected 
experiments. Algorithm [T] is our proposed solution to this problem. We focus on the problem of finding the 
maximizer of an unknown function for the rest of the paper, however, the result can be easily extended to 
minimization applications. 
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Definition 2. Expected Improvement (EI) ^ at any point x with corresponding GP prediction y = 
J\f{px, <^x) defined to be 

EI{x) = + <t>{u)) (3) 

where, u = [ymax ~ P'-x)/^x ond ymax = max yi. Also, <&(•) and represent standard Gaussian 
distribution and density functions respectively. 

In general, we have to predict the selected experiments by EI for a sequence of consecutive time steps. 
Clearly, we are certain about the first experiment x\ and we are interested in the EI value of the other selected 
points after sampling at point x\ . Therefore, the goal is to predict the EI values of other points after sampling 
at any point x\. After selecting the sample x\, we can calculate the variance of the output of any point z 
regardless of the output value of x\ \ but, the expectation of the output corresponding to z can highly depend 
on the output value of x\. Therefore, we cannot evaluate the exact EI of any point z before running the 
real experiment and observing the output. However, we can calculate an upper bound on the EI of the next 
experiments. Below, we will show that the EI upper bound of all points in the space can be computed by 
simply setting the output value of x\ to il/, where M is the maximum possible output value (which is easy 
to estimate or provide /given in many applications). 

Theorem 3. Fixing the variance , EI is an increasing fiinction of px- 

As a consequence of this theorem, by setting the output value of x'l to the maximum ill, we set an upper 
bound for the expectation of any point z that its mean and variance is changed after sampling at point x*. 
Therefore, the EI of any point z based on the current observation O U (xl, M) can be considered as the EI 
upper bound value after observing the output value of a; J . Thus, we select the next sample z based on current 
observation O U (x*, AI). If the next selected sample z satisfies E[| A* < e, then the expectation of its 

output changes at most by e. For small values of e, we can consider its expectation as constant before and 
after sampling x\ and hence z is independent of x\, experiment-wise. This entails that the point z is likely to 
be the next selected experiment by EI policy since its EI value is more than the EI upper bound of the points 
affected by sampling a:;^;. 



Algorithm 1 Dynamic Batch Expected Improvement Algorithm 

Input: Total budget of experiments (n;), maximum batch size {rih), the maximum observation value (M), 
current observation O = {xa,ya) and stopping threshold e. 
while rii > do 

xl arg max E^a^lO). 

A {xl,M), ni ^ ni — 1. 
z <~ arg max E^alO U A). 

x^X 

while (n; > 0) and {\A\ < nt) and (E [|A*(^i^)|] < e) do 
A-i^ AU {z,M), 
z *r- arg max E^a;!© U A). 

x£X 

end while 

y* RunExperiment(ccJ^) 

end while 
return max(yg) 



This algorithm is based on two main observations: (1) The selected sample z is far enough from xl and its 
expectation does not change after sampling at xl; (2) The EI upper bound of affected samples after sampling 
at xl is less than the EI of any point z. Therefore, the point z would likely be the next selected point for small 
value of e. This procedure is repeated until the next sample has E[|A*(/X2)|] > e indicating that we cannot 
find the next independent point. 

3.1 Other Sequential Approaches 

The proposed approach can be extended to other sequential policies such as Maximum Mean (MM), Max- 
imum Upper Interval (MUI) and Maximum Probability of Improvement (MPI) 0; however, the rate of 
the speedup would be significantly different among these approaches. For example, in MM with the ob- 
jective function x* = argmax^g;^- p^lo^ we would not have any speedup since we are only interested in 
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Table 1 : Benchmark Functions 



Cosines(2) 


1— (it^+ v'^ — 0.3 cos(37ru) — 0.3 cos(37rii)) 
u — 1.6x — 0.5, V — 1.6y — 0.5 


Rosenbrock(2) 


10-100(y-x2)2-(l_x)2 


Hartman(3,6) 


aixi, ^4xd, P4xd are constants 


Michalewicz(5) 


-Ei=iSin(a;i)sin(^if^j 


Shekel(4) 


E.^i ^.+s,^,4'(.,-A„)^ "1x10, ^4x10 are constants 



Table 2: Benchmark Performance 





Cosines 


Hydrog 


FC 


Rosen 


Hartman 3 


Michal 


Sliekel 


Hartman 6 


Sequential 


0.145 


0.026 


0.155 


0.005 


0.033 


0.369 


0.340 


0.222 


Semi(A/) 
Speedup(M) 


0.148 
6.8% 


0.026 
9.3% 


0.151 
10.2% 


0.005 
16.3% 


0.036 
10.1% 


0.379 
6.0% 


0.335 
2.1% 


0.220 
4.5% 


Semi(a = 0.1) 
Speedup(Q = 0.1) 


0.147 
16.23% 


0.027 

11.4% 


0.160 
11.7% 


0.006 
17.8% 


0.034 
18.4% 


0.375 

16.7% 


0.345 
17.4% 


0.233 
13.77% 



the point with the highest mean. Therefore, by setting the output of the selected point x* to the highest 
possible value, the next selected sample would be one of the closest samples to x* . For MUI policy with 
the objective function x* — argmax^g^i^ Majlo + l-96o'a;|ci, we expect a high rate of speedup since it is 
dominated by variance that is completely known by sampling at x*. For MPI with the objective function 

X* — maxxfzx $ the speedup varies by varying the parameter a. 

4 Experimental Results 

As we mentioned earlier, we use GP to build a posterior over our unknown function /(•). Our GP is based on 
zero-mean prior and squared exponential as covariance kernel function with kernel width / = O.OlS^^j^Zj with 
li being the length of the dimension 0]. We consider six well-known synthetic benchmarks: Cosines and 
Rosenbrock f?,'?] over [0, 1]^, Hartman(i) f?] over [0, 1]' for z = 3, 6, Shekel f3\ over [3, 6]* and Michalewicz 
lUd over [0, tt]^. The mathematical formulation of these functions are listed in Table [l] We also evaluate our 
approach on two real benchmarks Fuel Cell and Hydrogen over [0, 1]^. More details about these benchmarks 
can be found in |2|]. 

We run our algorithm on each benchmark 100 times and report the average regret; M — maxy.gy^ y^. In each 
run, the algorithm starts with 5 initial random points for 2, 3-dimensional benchmarks and 20 initial random 
points for higher dimensional benchmarks. The number of experiments rii is set to 20 for 2, 3-dimensional 
and 60 for the higher dimensional benchmarks and the maximum batch size at each iteration, rib, is set to 5. 
The parameter e is set to 0.02 for 2, 3-dimensional and 0.2 for higher dimensional benchmarks. 

Table |2] shows the performance of our algorithm on each benchmark along with the performance of the 
sequential policy. There are two different sets of results for each benchmark with different values of AI and 
a. Since setting the value of any selected sample z to a fixed value M does not match the reality, we consider 
an improvement over the best current observed output, T/max, as a surrogate to the fixed M; i.e., in each step, 
instead of A/, we use the value (1 + a)t/,nax- In this experiment, we set the value of a to 0.1 which means 
we expect 10% improvement over the best cuiTent observation after each sampling. 

To illustrate the speedup over the sequential policy, we calculate the speedup as the percentage of the samples 
in the whole experiment that are selected to be run in parallel in the batch mode. In general, if we exhaust ni 
samples in T steps, the speed up is calculated as (n; — T)/ni. Clearly, the maximum speedup in our setting 
is 0.8 that can only happen if we select 5 experiments at each and every time steps. The results show that, in 
both settings, we perform very close to the sequential policy and we achieve the maximum speedup of 16.3% 
with fixed M in Rosenbrock and 18.4% with varying A/ by a = 0.1 forHartman3. 

Figure |4] shows the speed up ratio versus the number of experiments in our 2, 3-dimensional benchmarks. It 
can be seen that as we increase the number of experiments the speedup ratio increases. Experimental results 
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10 20 30 40 50 60 

# of Experiments 



show that, when ni > 20, we usually ask for pt > 1 points at each iteration and as we get closer to 60, we 
usually reach the maximum possible batch size of = 5. The speedup ratio for Rosenbrock is significantly 
different from the other benchmarks since it has only one local/global maximum. 
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A Proof of Theorem 1 

Recalling the notation introduced in the Theorem statement, we have 

al-af =PA-^P^ -\P K\ ^ 

B D 



1 -1 


" p^ ' 







PA-^P' - [P k*] 



A-^ + A-^B'^mBA-^ 
-mBA-^ 

= {PA-^B^ - kl) m {BA-^P'^ - k*) . 
This concludes the proof of the theorem. 



-A-^B'^r 



m 



P^ 



(4) 



B Proof of Theorem 2 

Employing the notation of Theorem 1, we have 

PA-^yo-[P k 



A B^ 
B D 



Vo 

y* 



PA-^y^-[P kl 



A-^ +A-^B'^mBA-^ -A'^B'^m 

-mBA~'^ 



m 



Vo 

y* 



= I (PA-ifi^ - kl) m [BA-'va -y*)\. 



(5) 



Notice that by Cauchy Shwarz inequality, for a fixed vector v'^ = [PA ^B^ — kl) m and random vector 
u = BA~^yQ — y*, we have 



El 



(m^^\^,,\)^E[\ui\] 



= \\{PA-^B^ -kl) 



m 



o- 1- 



This concludes the proof of the Theorem. 



(6) 



C Proof of Theorem 3 

Taking the deriavtive with respect to and using the chain rule, we get 

^Elix) = ^-^<7x + 0(w)) 

an dfi du 



1 



fx (-$(-w) + u(j){-u) - u4>{u)) 



= > 0. 

This concludes the proof of the Theorem. 



(7) 
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